home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Tricks of the Mac Game Programming Gurus
/
TricksOfTheMacGameProgrammingGurus.iso
/
Information
/
CSMP Digest
/
volume 1
/
csmp-v1-198.txt
< prev
next >
Encoding:
Amiga
Atari
Commodore
DOS
FM Towns/JPY
Macintosh
Macintosh JP
NeXTSTEP
RISC OS/Acorn
Shift JIS
UTF-8
Wrap
Text File
|
1994-12-08
|
47.2 KB
|
1,298 lines
|
[
TEXT/R*ch
]
C.S.M.P. Digest Wed, 28 Oct 92 Volume 1 : Issue 198
Today's Topics:
Better random number than Random()?
BlockZero.c snippet
The Comp.Sys.Mac.Programmer Digest is moderated by Michael A. Kelly.
The digest is a collection of article threads from the internet newsgroup
comp.sys.mac.programmer. It is designed for people who read c.s.m.p. semi-
regularly and want an archive of the discussions. If you don't know what a
newsgroup is, you probably don't have access to it. Ask your systems
administrator(s) for details. (This means you can't post questions to the
digest.)
Each issue of the digest contains one or more sets of articles (called
threads), with each set corresponding to a 'discussion' of a particular
subject. The articles are not edited; all articles included in this digest
are in their original posted form (as received by our news server at
cs.uoregon.edu). Article threads are not added to the digest until the last
article added to the thread is at least one month old (this is to ensure that
the thread is dead before adding it to the digest). Article threads that
consist of only one message are generally not included in the digest.
The entire digest is available for anonymous ftp from ftp.cs.uoregon.edu
[128.223.8.8] in the directory /pub/mac/csmp-digest. Be sure to read the
file /pub/mac/csmp-digest/README before downloading any files. The most
recent issues are available from sumex-aim.stanford.edu [36.44.0.6] in the
directory /info-mac/digest/csmp. If you don't have ftp capability, the sumex
archive has a mail server; send a message with the text '$MACarch help' (no
quotes) to LISTSERV@ricevm1.rice.edu for more information.
The digest is also available via email. Just send a note saying that you
want to be on the digest mailing list to mkelly@cs.uoregon.edu, and you will
automatically receive each new issue as it is created. Sorry, back issues
are not available through the mailing list.
Send administrative mail to mkelly@cs.uoregon.edu.
-------------------------------------------------------
From: hd12@ellis.uchicago.edu (hui dong)
Subject: Better random number than Random()?
Date: 18 Sep 92 10:54:10 GMT
Organization: University of Chicago Computing Organizations
(I did check FAQ before I post this.)
I need a fast (speed is very important) random number generator. Right now
I am using Random(), but it seems to me the distribution is not uniform
at all. Is there a better generator? Will it be faster to use my own
function than to use toolbox routine?
Thanks for the help.
+++++++++++++++++++++++++++
From: fry@tara.harvard.edu (David Fry)
Date: 19 Sep 92 05:20:38 GMT
Organization: Harvard Math Department
In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
>(I did check FAQ before I post this.)
>
>I need a fast (speed is very important) random number generator. Right now
>I am using Random(), but it seems to me the distribution is not uniform
>at all. Is there a better generator? Will it be faster to use my own
>function than to use toolbox routine?
>Thanks for the help.
This probably should be in the FAQ list because it comes up all the
time (and it seems I'm always the one to answer it). Random() is a
quite good random number generator, and it is pretty fast. It will
almost certainly be faster than one you write yourself, provided the
one you write yourself is as good.
Are you changing the randSeed seed variable before calling it the
first time? You can use the result of a GetDateTime() call as the
initial seed.
Why do you think it isn't uniform? Are you doing anything strange
with the results of Random(), like using a "mod" to get a smaller
random number? This decreases the quality of your random numbers,
although not as much with Random() as with some other RNG's.
David Fry fry@math.harvard.edu
Division of Applied Sciences fry@huma1.bitnet
Harvard University ...!harvard!huma1!fry
Cambridge, MA 02138
+++++++++++++++++++++++++++
From: jvp@vlsi1.dell.com (Jim Van Peursem)
Date: 19 Sep 92 15:56:43 GMT
Organization: Iowa State University, Ames IA
In <1992Sep19.012039.15704@husc3.harvard.edu> fry@tara.harvard.edu (David Fry) writes:
>>I need a fast (speed is very important) random number generator. Right now
>>I am using Random(), but it seems to me the distribution is not uniform
>>at all. Is there a better generator? Will it be faster to use my own
>>function than to use toolbox routine?
>>Thanks for the help.
>This probably should be in the FAQ list because it comes up all the
>time (and it seems I'm always the one to answer it). Random() is a
>quite good random number generator, and it is pretty fast. It will
>almost certainly be faster than one you write yourself, provided the
>one you write yourself is as good.
What kind of RNG is Random()? Isn't it a linear congruential?
If it is, better quality RNG's exist such as Lagged Fibbonacci.
+---------------------------------------------------------------+
| Jim Van Peursem - Ph.D. Student - Ham Radio -> KE0PH |
| System Administrator - Digital Systems Design Lab |
| Department of Electrical Engineering and Computer Engineering |
| Iowa State University - Ames, IA 50011 |
| internet - jvp@cpre1.ee.iastate.edu |
+---------------------------------------------------------------+
+++++++++++++++++++++++++++
From: sw@network-analysis-ltd.co.uk (Sak Wathanasin)
Date: 19 Sep 92 08:45:28 GMT
Organization: Network Analysis Ltd
In article <1992Sep18.105410.24438@midway.uchicago.edu> (comp.sys.mac.programmer), hd12@ellis.uchicago.edu (hui dong) writes:
> (I did check FAQ before I post this.)
Well, it ought to be in the FAQ, since this comes up every few months.
It is covered in the Usenet Mac Programmer's Guide.
> I need a fast (speed is very important) random number generator. Right now
> I am using Random(), but it seems to me the distribution is not uniform
> at all. Is there a better generator? Will it be faster to use my own
> function than to use toolbox routine?
Ignore the 16-bit result returned by Random(), and use the full 32-bit
value that is in randSeed. Details are in the UMPG. The Random() in rom
is the same as the Park & Miller "minimal standard RNG" described in a
CACM paper, and guards against overflows during intermediate calculations.
You will be hard pushed to write a faster version in C or Pascal, and will
have to use assembler to beat it.
Sak Wathanasin
Network Analysis Limited
178 Wainbody Ave South, Coventry CV3 6BX, UK
uucp: ...!uknet!nan!sw Phone: (+44) 203 419996
AppleLink: NAN.LTD Internet: sw@network-analysis-ltd.co.uk
+++++++++++++++++++++++++++
From: fry@tara.harvard.edu (David Fry)
Date: 19 Sep 92 13:42:12 EDT
Organization: Harvard Math Department
In article <jvp.716918203@vlsi1> jvp@vlsi1.dell.com (Jim Van Peursem) writes:
> What kind of RNG is Random()? Isn't it a linear congruential?
>If it is, better quality RNG's exist such as Lagged Fibbonacci.
Yes, it's linear congruential. Certainly better RNG's exist, but for
more post purposes, Random() is more than good enough. It follows the
so-called "minimal standard" RNG proposed in "Random Number
Generators: Good Ones are Hard to Find" by Park and Miller, in
Communications of the ACM, October 1988. This article details the
history of faulty RNGs and explains why this is a fast and good one.
As someone else pointed out, to really get the best quality from
Random(), you should ignore there result of the function (which is
16-bits), and use the value in randSeed (which is 32-bits) after
calling Random(). In that case you have exactly the RGN described in
the CACM paper. I think even using the 16-bit number is good
enough for most purposes because the lower 16-bits of the RNG are
sufficiently random. Still, I admit to using the full 32-bits in my
work :-).
David Fry fry@math.harvard.edu
Division of Applied Sciences fry@huma1.bitnet
Harvard University ...!harvard!huma1!fry
Cambridge, MA 02138
+++++++++++++++++++++++++++
From: Bruce.Hoult@bbs.actrix.gen.nz
Organization: Actrix Information Exchange
Date: Sun, 20 Sep 1992 04:06:49 GMT
In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
> (I did check FAQ before I post this.)
>
> I need a fast (speed is very important) random number generator. Right now
> I am using Random(), but it seems to me the distribution is not uniform
> at all. Is there a better generator? Will it be faster to use my own
> function than to use toolbox routine?
> Thanks for the help.
The toolbox Random() functions is perfectly OK if you use the *seed*
as your random value and ignore the function result. OTOH, it's not
exactly fast because of the Trap call overhead.
I've included my own random number generator, which requires a 68020
or better.
I've also included my attempt at a random generator that produces
"normally" distributed numbers. It's based on the Central Limits
Theorem that says the sum of several independant random variables is
approximately normally distributed, no matter what distribution the
individual variables have. I found that for my purposes the sum of
ten uniformally distributed variables was good enough.
The only mildly clever thing about it is that I do a little parallel
computing by working out the random numbers in the CPU and then adding
them (to get a greater then 32 bit result) in the coprocessor.
Comment is invited on whether this is a good way to generate a normal
distribution, or not.
- -- Bruce
- ---------------
procedure MyRandom(var seed:longint);
inline
$225F, {movea.l (a7)+,a1}
$2011, {move.l (a1),d0}
$4C3C, $0401, $0000, $41A7, {mulu.l #41a7,d1:d0}
$4C7C, $0401, $7FFF, $FFFF, {divu.l #7fffffff,d1:d0}
$2281; {move.l d1,(a1)}
procedure MyNormal(
numToAdd:integer;
var seed:longint;
var result:extended
); external;
- ------------------------------------------------------
machine mc68020
mc68881
MyNormal proc export
link a6,#0
fmove.w #0,fp0
move.l 8(a6),a1 ;pointer to result
move.l 12(a6),a0 ;pointer to seed
move.l (a0),d0
move.w 16(a6),d2 ;number of loops
bra.b loopend
loop mulu.l #16807,d1:d0 ;7^5
divu.l #$7fffffff,d1:d0
fadd.l d1,fp0
move.l d1,d0
loopend dbra d2,loop
move.l d0,(a0) ;update the seed
fmove.x fp0,(a1) ;return the result
unlk a6
move.l (a7)+,a0
add.l #10,a7
jmp (a0)
endproc
end
- --
Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
hard disk that fits in your pocket!" "Great! Is it PC compatible?"
+++++++++++++++++++++++++++
From: ksand@apple.com (Kent Sandvik)
Date: 20 Sep 92 19:06:23 GMT
Organization: Apple
In article <1992Sep19.012039.15704@husc3.harvard.edu>, fry@tara.harvard.edu
(David Fry) wrote:
> In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
> >I need a fast (speed is very important) random number generator. Right now
> >I am using Random(), but it seems to me the distribution is not uniform
> >at all. Is there a better generator? Will it be faster to use my own
> >function than to use toolbox routine?
> >Thanks for the help.
> This probably should be in the FAQ list because it comes up all the
> time (and it seems I'm always the one to answer it). Random() is a
> quite good random number generator, and it is pretty fast. It will
> almost certainly be faster than one you write yourself, provided the
> one you write yourself is as good.
True, at least the Usenet Mac Programmer Guide has a lot of information
about alternate random generator algorithms and code. I've done some
tests myself, and it wasn't really justified to write alternate random
generators, for instance I didn't see much performance improvements
by inlining hex code instead of calling a trap. Check the TRandom C++
class on the developer CD for my silly tests.
Cheers,
Kent/DTS
"You should really just relax" -MST3K
- -------------------
Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com)
DISCLAIMER: Private activities on the Net.
+++++++++++++++++++++++++++
From: Bruce.Hoult@bbs.actrix.gen.nz
Date: 21 Sep 92 01:34:32 GMT
Organization: Actrix Information Exchange
In article <ksand-200992120624@wintermute.apple.com> ksand@apple.com (Kent Sandvik) writes:
> I've done some
> tests myself, and it wasn't really justified to write alternate random
> generators, for instance I didn't see much performance improvements
> by inlining hex code instead of calling a trap. Check the TRandom C++
> class on the developer CD for my silly tests.
What machine were you testing on?
Whn I wrote the code I posted last night I was on a IIcx and there was
a *massive* difference in trap overhead vs procedure call overhead.
The difference seems to have gotten much smaller on the '040 -- in a
test a while ago on the Q700 I found that a function call and return
with three longword parameters and a trivial body (like SetPt) took just
over 1 uS while a similar trap call took 3 uS. (I hope i've
remembered those right :-)
- --
Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
hard disk that fits in your pocket!" "Great! Is it PC compatible?"
+++++++++++++++++++++++++++
From: ksand@apple.com (Kent Sandvik)
Date: 22 Sep 92 02:02:51 GMT
Organization: Apple
In article <1992Sep21.013432.18668@actrix.gen.nz>,
Bruce.Hoult@bbs.actrix.gen.nz wrote:
> Whn I wrote the code I posted last night I was on a IIcx and there was
> a *massive* difference in trap overhead vs procedure call overhead.
> The difference seems to have gotten much smaller on the '040 -- in a
> test a while ago on the Q700 I found that a function call and return
> with three longword parameters and a trivial body (like SetPt) took just
> over 1 uS while a similar trap call took 3 uS. (I hope i've
> remembered those right :-)
When I think about it my tests were on a Quadra 900. I will do
some more additional tests on other machines later.
Kent
"You should really just relax" -MST3K
- -------------------
Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com)
DISCLAIMER: Private activities on the Net.
---------------------------
From: mgleason@cse.unl.edu (Mike Gleason)
Subject: BlockZero.c snippet
Date: 15 Sep 92 18:50:19 GMT
Organization: NCEMRSoft
I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
so here's a freebie function that clears an area of memory (which can start
on an odd boundary). It's much faster than using memset(ptr,0,n) because
this function uses move.l's for the majority of the block. It compiles
under Think C, but the asm {} blocks may break other compilers.
I use this function all the time, especially when dealing with parameter
blocks. e.g.:
#define ZERO(a) BlockZero(&(a), sizeof (a))
{
HParamBlockRec pb;
ZERO(pb);
}
Code follows. Further optimizations, your cool code snippets, bug fixes
<gulp> welcome....
- ----------------
void BlockZero(void *area, register unsigned long n);
void BlockZero(void *area, register unsigned long n)
{
register char *p = (char *)area;
register unsigned long fours;
register unsigned long remainder, zero;
if (n) {
if ((long) p & 01) {
/* Ptr on an odd boundary... */
asm {
clr.b (p)+
}
--n;
/*
* Ptr is now on an even boundary,
* and it's first byte is zeroed.
*/
}
fours = n / 4L; /* Can be zero. */
/* Do the meat of the block using longwords for speed. */
asm {
clr.l zero
bra @2
@1: move.l zero, (p)+
@2: subq.l #1, fours
tst.l fours
bge @1
}
/*
* If the size of the block isn't divisible by 4, we'll
* have to do the remaining bytes (up to 3) by hand.
*/
remainder = n & 03; /* same as remainder = n % 4. */
asm {
bra @4
@3: clr.b (p)+
@4: dbra remainder, @3
}
}
} /* BlockZero */
/* eof BlockZero.c */
- --
- --mg mgleason@cse.unl.edu
+++++++++++++++++++++++++++
From: keith@taligent.com (Keith Rollin)
Date: 16 Sep 92 17:55:14 GMT
Organization: Taligent
In article <1992Sep15.185019.22483@unlinfo.unl.edu>, mgleason@cse.unl.edu
(Mike Gleason) wrote:
>
> I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
> so here's a freebie function that clears an area of memory (which can start
> on an odd boundary). It's much faster than using memset(ptr,0,n) because
> this function uses move.l's for the majority of the block. It compiles
> under Think C, but the asm {} blocks may break other compilers.
>
> [ code deleted ]
>
True, not only will this not work under MPW (which doesn't have inline
asm), but MPW's memset library routine is better than the one you post.
Specifically, for large enough blocks, it uses a series of 4 MOVE.L
instructions in a row for setting the block. On the other hand, I don't
know what the timing differences between MOVE.L and CLR.L are, so perhaps a
specialized bzero() based on MPW's algorithm would be nice.
- -----
Keith Rollin
Phantom Programmer
Taligent, Inc.
+++++++++++++++++++++++++++
From: nagel@Cigna.COM (Mark Nagel)
Date: 16 Sep 92 23:09:16 GMT
Organization: CIGNA FIRST, Irvine, CA
In <keith-160992104732@kip-58.taligent.com> keith@taligent.com (Keith Rollin) writes:
>instructions in a row for setting the block. On the other hand, I don't
>know what the timing differences between MOVE.L and CLR.L are, so perhaps a
>specialized bzero() based on MPW's algorithm would be nice.
I'm curious about this. I picked up a Motorola 68K family processor
manual, but it doesn't have instruction timing info. I have noticed
that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
D0. I gather from this that MOVEQ.L is faster, but I'd still like
to see timings. Anyone know a good assembler book for the 68K that
has info like that? Something along the lines of "Zen and the Art
of Assembly Language Programming" would be wonderful.
Mark
- --
Mark Nagel <nagel@cigna.com> | DISCLAIMER: Any resemblence of these
Network Administrator | opinions to those of CIGNA are purely
CIGNA/FIRST | coincidental.
(Don't start vast projects with half-vast ideas)
+++++++++++++++++++++++++++
From: gurgle@netcom.com (Pete Gontier)
Date: 17 Sep 92 02:47:01 GMT
Organization: cellular
nagel@Cigna.COM (Mark Nagel) writes:
>I'm curious about this. I picked up a Motorola 68K family processor
>manual, but it doesn't have instruction timing info. I have noticed
>that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
>D0. I gather from this that MOVEQ.L is faster, but I'd still like
>to see timings.
Notice, though, that MOVEQ has very limited destination operand types.
If I remember correctly, registers only. So that renders its use in
the context of bzero( ) pretty dubious. Me, I just call NewHandleClear
in the first place and assume Apple's Cray can write better 68000
than me. :-)
- --
Pete Gontier // EC Technology // gurgle@netcom.com
+++++++++++++++++++++++++++
From: Bruce.Hoult@bbs.actrix.gen.nz
Date: 17 Sep 92 05:42:50 GMT
Organization: Actrix Information Exchange
In article <1992Sep15.185019.22483@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes:
> I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
> so here's a freebie function that clears an area of memory (which can start
> on an odd boundary). It's much faster than using memset(ptr,0,n) because
> this function uses move.l's for the majority of the block. It compiles
> under Think C, but the asm {} blocks may break other compilers.
You should add in a possible 2 byte move at the start and end to align
the pointer on a longword boundary. This will increase speed on the
32-bit bus machines.
You should unroll the main loop to do several move.l's each time around.
You should use dbra for the loop control (only two bytes longer than your
version and much faster)...
bra @3
@1 swap fours
@2 move.l zero,(p)+ # or several of them..
@3 dbra fours, @2
swap fours
dbra fours, @1
- --
Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
hard disk that fits in your pocket!" "Great! Is it PC compatible?"
+++++++++++++++++++++++++++
From: Bruce.Hoult@bbs.actrix.gen.nz
Date: 17 Sep 92 10:08:16 GMT
Organization: Actrix Information Exchange
In article <1992Sep16.230916.18166@Cigna.COM> nagel@Cigna.COM (Mark Nagel) writes:
> I'm curious about this. I picked up a Motorola 68K family processor
> manual, but it doesn't have instruction timing info. I have noticed
> that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
> D0. I gather from this that MOVEQ.L is faster, but I'd still like
> to see timings. Anyone know a good assembler book for the 68K that
> has info like that? Something along the lines of "Zen and the Art
> of Assembly Language Programming" would be wonderful.
My official Motorola 68000 (M68000UM/AD REV 4 1986) and 68020
(MC68020UM/AD REV 1 1985) books both have tables of timing
information, but the 68020 one is full of disclaimers about how
complex figuring timings is on modern architectures, and in fact I
think motorola quit publishing instruction timing tables. You've
really got to try *your* instruction sequence to find out, unless
you've got really intimate knowledge of the undocumented internals of
the pipeline etc.
- --
Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
hard disk that fits in your pocket!" "Great! Is it PC compatible?"
+++++++++++++++++++++++++++
From: mgleason@cse.unl.edu (Mike Gleason)
Date: 17 Sep 92 18:29:11 GMT
Organization: NCEMRSoft
Bruce.Hoult@bbs.actrix.gen.nz writes:
>You should unroll the main loop to do several move.l's each time around.
>You should use dbra for the loop control (only two bytes longer than your
>version and much faster)...
I tried this for the main block, but if I recall, dbra's loop counter
was acting like a 16-bit word instead of a nice big longword.
Some other folks have sent me some tips, so I will tweak it more using
these tricks that aren't covered in my VAX Assembler book. I have attempted
to get my hands on a 68000 assembly book for the mac, but so far all the
books mentioned in the FAQ (and the one E. Wylie suggested recently) are
either out of print or my book store can't order them :-(
- --
- --mg mgleason@cse.unl.edu
+++++++++++++++++++++++++++
From: mgleason@cse.unl.edu (Mike Gleason)
Date: 18 Sep 92 05:27:49 GMT
Organization: NCEMRSoft
Here's a much improved BlockZero function. This one now long word aligns,
and instead of using one small move.l loop to clear 4 bytes at a time, this
version clears 144 bytes per loop iteration, then 16 bytes at a time when
the block has less than 144 dirty bytes left, then 1 byte at a time for any
remaining bytes. In other words, it performs OK on small blocks, pretty
good on medium size blocks, and excellent on large blocks. Of course there
aren't many times where one would need to zero a big block, since most big
blocks are pointers that can be cleared the same time they are allocated,
but it still makes a good general-purpose bzero.
Thanks to everyone who responded, especially John Bruner and Michael Hecht.
I'm sure there are more optimizations that could be done, but I'll let the
Gods of the 68000 worry about that :-)
Here it is, it requires Think C to compile.
/* BlockZero.c */
/* Handy macro: */
#define ZERO(block,size) BlockZero(&(block), (long) (size))
void BlockZero(void *area, long n);
#define PTR a0 /* Our copy of 'area.' */
#define COUNT d0 /* Our copy of 'n.' */
/*
* Using registers d1 - d7, and a1 - a5 to hold zeroes for use
* with MOVEM. Could use a6 also if you can trick the inline
* assembler to not use LINK/UNLK.
*/
#define nAddrRegsUsed 5
#define nDataRegsUsed 7
/*
* One MOVEM instruction will clear this many bytes at a time.
*/
#define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long))
/*
* This is how many MOVEM instructions you want in your loop. For each
* extra one you add, then large blocks will clear faster; however
* smaller blocks will be slower because it will use the intermediate
* loop, where only 16 bytes are cleared at a time.
*/
#define UnrolledLoops 3
/*
* For each UnrolledLoop, do one of these.
*/
#define Loop movem.l d1-d7/a1-a5, -(PTR)
/*
* This is how many bytes are cleared by all of your MOVEMs before
* we have to decrement the count and start another loop iteration.
*/
#define TotalSize (ZSize * UnrolledLoops)
void BlockZero(void *area, long n)
{
asm {
movem.l d0-d7/a0-a5, -(sp)
move.l n, COUNT ; Could use index from the SP
movea.l area, PTR ; for these instead of a6.
add.l COUNT, PTR ; Start at end of block,
; and go backwards.
move.w PTR, d1 ; Find (area + n) % 4.
andi.w #3, d1
sub.l d1, COUNT ; We will byte-zero this
; many bytes to longword align,
; so subtract this from the
; total.
moveq.l #0, d2
bra @2 ; Zero the last 3 or so
@1: move.b d2, -(PTR) ; bytes to align PTR on a
@2: dbra d1, @1 ; longword boundary.
; Clear out as many registers
; as possible to use with MOVEM.
move.l d2, d1 ; Don't need d1 for scratch.
move.l d2, d3
move.l d2, d4
cmpi.l #TotalSize, COUNT
bls @6 ; If it is a smaller block, then
; jump to the section that clears
; 16 at a time.
; Otherwise we'll need to clear
; out the rest of these registers
; to use in a huge MOVEM.
move.l d2, d5
move.l d2, d6
move.l d2, d7
movea.l d2, a1
movea.l a1, a2
movea.l a1, a3
movea.l a1, a4
movea.l a1, a5
; Now zero as much as we can
; in blocks of 144 (3 movem's
; of 48 bytes each).
bra @4
@3: Loop ; Add more Loops here if you
Loop ; like, but adjust the #defines
Loop ; above.
@4: subi.l #TotalSize, COUNT
bge @3
; The remaining block is now
; less than TotalSize, so now
; zero as much as we can
; in blocks of 16.
addi.l #TotalSize, COUNT
bra @6
@5: movem.l d1-d4, -(PTR)
@6: subi.l #16, COUNT
bge @5
; The remaining block is now
; less than 16, so now
; use byte-zeroing.
addi.l #16, COUNT
bra @8
@7: move.b d2, -(PTR)
@8: dbra COUNT, @7
; Restore nuked registers.
movem.l (sp)+,d0-d7/a0-a5
}
} /* BlockZero */
/* eof BlockZero.c */
- --
______________________________________________________________________________
mike gleason mgleason@cse.unl.edu NCEMRSoft, baby!
+++++++++++++++++++++++++++
From: d88-jwa@musta.nada.kth.se (Jon Wtte)
Date: 18 Sep 92 09:59:59 GMT
Organization: Royal Institute of Technology, Stockholm, Sweden
> mgleason@cse.unl.edu (Mike Gleason) writes:
I tried this for the main block, but if I recall, dbra's loop counter
was acting like a 16-bit word instead of a nice big longword.
Well, using the SWAP function you can use DBRA as a 32-bit counter.
Still beats the MS-DOS programming model :-)
- --
Jon W{tte, h+@nada.kth.se, Sweden, Phone +46-8-107069
Help eradicate FIDO-Net <-> Usenet gateways in our time!
+++++++++++++++++++++++++++
From: Bruce.Hoult@bbs.actrix.gen.nz
Date: 18 Sep 92 07:55:21 GMT
Organization: Actrix Information Exchange
In article <1992Sep17.182911.27693@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes:
> Bruce.Hoult@bbs.actrix.gen.nz writes:
>
> >You should use dbra for the loop control (only two bytes longer than your
> >version and much faster)...
>
> I tried this for the main block, but if I recall, dbra's loop counter
> was acting like a 16-bit word instead of a nice big longword.
Correct -- that's why the code I showed used *two* DBRA's in nested
loops, with the tricky little SWAPs to get the high word of the count
register into the low word for the outer DBRA and back again. It's a
little bit ugly but correctly handles a 32-bit count, is only 2 bytes
bigger than what you were doing, and the extra SWAPs only get executed
once every 65536 loops so their overhead is almost zero. All of which
make this method (thanks to Lawrence for originally pointing the
method out!) overall tidier than the obvious nested DBRA method of
putting 16-bit counts in two different registers -- which has a little
less overhead every 65536 loops, but uses at minimum an extra MOVE.L
and SWAP in the loop prologue and uses an extra precious register,
for the same size code.
Isn't it amazing how much fun saving a couple of bytes or cycles can
be? ;-)
- --
Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
hard disk that fits in your pocket!" "Great! Is it PC compatible?"
+++++++++++++++++++++++++++
From: wombat@claris.com (Scott Lindsey)
Date: 18 Sep 92 09:18:30 GMT
Organization: Claris Corporation, Vancouver WA
In article <keith-160992104732@kip-58.taligent.com>, keith@taligent.com (Keith Rollin) writes:
>
> On the other hand, I don't
> know what the timing differences between MOVE.L and CLR.L are, so perhaps a
Well, according to the 5th edition of the M68000 Programmer's Reference Manual
from Motorola,
MOVE.L #<data>,(An)+ 20(3/2)
CLR.L (An)+ 24(2/4) + 16(4/0)
where the first number is clock periods, and the 2 parenthesized are bus read &
write cycles. the 16(4/0) is the time required for the (An)+ effective address
calculation. So on a 68000, the MOVE beats the CLR.
The manual also includes timings for the 68010. The MOVE timing is same,
but the CLR timing is 12(1/2) for (An)+ with no additional time for address
calculation. So CLR is the winner. I don't have timings for 020, 030, 040's,
but one would expect them to be more like the 010 than the 000. Note that the
010 also has a "loop mode" where opcode fetches are suppressed, speeding up
the loop even more.
So I would guess that if you're optimizing for the 68000 (which needs it the
most) then use MOVE, but use CLR if you're optimizing for > 68000.
- --
Scott Lindsey <wombat@claris.com>
QUIT
Scott Lindsey <wombat@claris.com>
+++++++++++++++++++++++++++
From: peterc@cubetech.com (Peter Creath)
Date: Thu, 17 Sep 92 22:25:37 CDT
Organization: Cube Technologies
Here's my optimization on Mike Gleason's BlockZero.c snippet:
(I'm gonna remove the comments within the code and explain it here...)
This is shorter, primarily because the code itself (not the declarations)
is entirely in assembly. Therefore, it can be used by any Pascal
or C compiler with an inline assembler.
This is also slightly faster and smaller, since I killed possible
inefficiencies in the compiled code. The original routine was designed
for speed, not for size, so I added an option that compiles a MUCH
smaller (but slower) version:
#define FAST /* leave undefined for the SMALL version */
#define ZERO(a) BlockZero(&(a), sizeof (a))
{
HParamBlockRec pb;
ZERO(pb);
}
void BlockZero(void *area, register unsigned long n);
void BlockZero(void *area, register unsigned long n)
{
register char *p = (char *)area;
register unsigned long temp;
register unsigned long zero;
#ifdef FAST
asm {
tst.l n
beq.s @skipit /* only if len > 0 */
move.w p,temp
andi.w #0x01,temp /* if odd offset, */
beq.s @skipeven
clr.b (p)+ /* clear odd byte and */
subq.l #0x01,n /* make offset even */
@skipeven:
move.l n,temp
asr.l #0x02,temp /* len/4 (# of longs in area) */
bra.s @fourloop
@clrloop: /* clear out longs speedily */
clr.l (p)+
@fourloop:
dbf temp,@clrloop
@nofours:
move.b n,temp
andi.b #0x03,temp /* check remainder (len % 4) */
bra.s @oddloop
@clearodd:
clr.b (p)+ /* clear remaining bytes */
@oddloop:
dbf temp, @clearodd
@skipit:
}
#else /* if SMALL */
asm {
move.l n,temp
bra.s @loopback
@loop:
clr.b (p)+
@loopback:
dbf temp,@loop
}
#endif
} /* BlockZero */
/* eof BlockZero.c */
- ----------------------------------------------------------------------------
Peter Creath "When I was a boy I was told that anybody could
peterc@cubetech.com become president; I'm beginning to believe it."
-- Clarence Darrow
+++++++++++++++++++++++++++
From: mgleason@cse.unl.edu (Mike Gleason)
Date: 21 Sep 92 03:10:42 GMT
Organization: NCEMRSoft
peterc@cubetech.com (Peter Creath) writes:
>Lines: 75
>Here's my optimization on Mike Gleason's [original] BlockZero.c snippet:
>(I'm gonna remove the comments within the code and explain it here...)
>This is shorter, primarily because the code itself (not the declarations)
>is entirely in assembly. Therefore, it can be used by any Pascal
>or C compiler with an inline assembler.
>This is also slightly faster and smaller, since I killed possible
>inefficiencies in the compiled code. The original routine was designed
>for speed, not for size, so I added an option that compiles a MUCH
>smaller (but slower) version:
Thanks for the improvements; unfortunately they were made to the first
version. The second one I posted is over 3 times faster. It compiles
to about 146 bytes (as opposed to 84 bytes for version 1), so you may
want to re-optimize it for size if the extra space is bothersome.
- --mg
p.s., here's the benchmarks of my functions. Please don't make fun of my
slow 7.x MHz CPU :-)
Function 0 is Symantec's memset;
1 is my first BlockZero;
2 is the new BlockZero.
3 is Michael Hecht's wzbzeri;
Function 0: 7 bytes in 89 ticks 59.91 KB/sec
Function 1: 7 bytes in 112 ticks 47.61 KB/sec
Function 2: 7 bytes in 162 ticks 32.91 KB/sec
Function 3: 7 bytes in 170 ticks 31.36 KB/sec
Function 0: 99 bytes in 464 ticks 162.52 KB/sec
Function 1: 99 bytes in 216 ticks 349.12 KB/sec
Function 2: 99 bytes in 194 ticks 388.71 KB/sec
Function 3: 99 bytes in 261 ticks 288.93 KB/sec
Function 0: 335 bytes in 1423 ticks 179.32 KB/sec
Function 1: 335 bytes in 495 ticks 515.51 KB/sec
Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
Function 3: 335 bytes in 460 ticks 554.73 KB/sec
Function 0: 2048 bytes in 8154 ticks 191.32 KB/sec
Function 1: 2048 bytes in 2499 ticks 624.25 KB/sec
Function 2: 2048 bytes in 910 ticks 1714.29 KB/sec
Function 3: 2048 bytes in 1291 ticks 1208.37 KB/sec
Function 0: 308294 bytes in 1243151 ticks 188.90 KB/sec
Function 1: 308294 bytes in 358464 ticks 655.11 KB/sec
Function 2: 308294 bytes in 114221 ticks 2055.96 KB/sec
Function 3: 308294 bytes in 170072 ticks 1380.79 KB/sec
- --
- --mg mgleason@cse.unl.edu
+++++++++++++++++++++++++++
From: keith@taligent.com (Keith Rollin)
Date: 21 Sep 92 18:17:57 GMT
Organization: Taligent
In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu
(Mike Gleason) wrote:
>
> p.s., here's the benchmarks of my functions. Please don't make fun of my
> slow 7.x MHz CPU :-)
>
> Function 0 is Symantec's memset;
> 1 is my first BlockZero;
> 2 is the new BlockZero.
> 3 is Michael Hecht's wzbzeri;
>
> Function 2: 7 bytes in 162 ticks 32.91 KB/sec
> Function 2: 99 bytes in 194 ticks 388.71 KB/sec
> Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
>
I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU.
NEARLY 3 SECONDS TO CLEAR 7 BYTES???
Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only
20% longer to clear more than 12 times that much? Or 194 units to clear 99
bytes and only 10 units more to clear over 3 times that much? The
overhead's gotta be killer.
And the math doesn't work... (7 bytes / 162 ticks) * (60 ticks / 1 second)
comes to .00259 KB/second. You'd have to assume Mike ran 12,693 iterations
in order to get the tickcount he did on that first test.
- -----
Keith Rollin
Phantom Programmer
Taligent, Inc.
+++++++++++++++++++++++++++
From: mgleason@cse.unl.edu (Mike Gleason)
Date: 21 Sep 92 22:36:09 GMT
Organization: NCEMRSoft
keith@taligent.com (Keith Rollin) writes:
>In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu
>(Mike Gleason) wrote:
>>
>> p.s., here's the benchmarks of my functions. Please don't make fun of my
>> slow 7.x MHz CPU :-)
>>
>> Function 0 is Symantec's memset;
>> 1 is my first BlockZero;
>> 2 is the new BlockZero.
>> 3 is Michael Hecht's wzbzeri;
>>
>> Function 2: 7 bytes in 162 ticks 32.91 KB/sec
>> Function 2: 99 bytes in 194 ticks 388.71 KB/sec
>> Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
>>
>I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU.
>NEARLY 3 SECONDS TO CLEAR 7 BYTES???
:-) I evidently forgot to mention that I was using VIA_ticks() that comes
with Symantec's ANSI library. According to them there are "approximately
780,000" per second. Originally my test suite was using system Ticks, but
60 units per second isn't enough so I decided to try using the VIA timer
that Symantec's profiler uses.
>Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only
>20% longer to clear more than 12 times that much? Or 194 units to clear 99
>bytes and only 10 units more to clear over 3 times that much? The
>overhead's gotta be killer.
There is a bit of overhead. Here's a bit of pseudo-code to illustrate it:
(Did the second version of the code get through? You apparently haven't
seen it and I have had an email request. I guess I will just append it
again at the end of this message; sorry if this is wasting bandwidth.)
make a copy of 'n' for the counter;
make a copy of the pointer to the area to be zeroed;
align our ptr on a longword boundary;
if (n >= 144) then
zero 144 at a time, subtract amount zeroed from n;
if (n >= 16) then
zero 16 a time, subtract amount zeroed from n;
if (n >= 1) then
zero 1 at a time.
If you look at the results for Symantec's memset(), which is just one byte
at a time, it doesn't do much better for small sizes, despite having little
overhead. It just looks like mine has a lot of overhead because it does
so much better on big blocks.
Function 0: (memset) 7 bytes in 89 ticks 59.91 KB/sec
Function 0: 99 bytes in 464 ticks 162.52 KB/sec
Function 0: 335 bytes in 1423 ticks 179.32 KB/sec
- --mg
(Version 2 of BlockZero follows...)
/* BlockZero.c */
/* Handy macro: */
#define ZERO(block,size) BlockZero(&(block), (long) (size))
void BlockZero(void *area, long n);
#define PTR a0 /* Our copy of 'area.' */
#define COUNT d0 /* Our copy of 'n.' */
/*
* Using registers d1 - d7, and a1 - a5 to hold zeroes for use
* with MOVEM. Could use a6 also if you can trick the inline
* assembler to not use LINK/UNLK.
*/
#define nAddrRegsUsed 5
#define nDataRegsUsed 7
/*
* One MOVEM instruction will clear this many bytes at a time.
*/
#define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long))
/*
* This is how many MOVEM instructions you want in your loop. For each
* extra one you add, then large blocks will clear faster; however
* smaller blocks will be slower because it will use the intermediate
* loop, where only 16 bytes are cleared at a time.
*/
#define UnrolledLoops 3
/*
* For each UnrolledLoop, do one of these.
*/
#define Loop movem.l d1-d7/a1-a5, -(PTR)
/*
* This is how many bytes are cleared by all of your MOVEMs before
* we have to decrement the count and start another loop iteration.
*/
#define TotalSize (ZSize * UnrolledLoops)
void BlockZero(void *area, long n)
{
asm {
movem.l d0-d7/a0-a5, -(sp)
move.l n, COUNT ; Could use index from the SP
movea.l area, PTR ; for these instead of a6.
add.l COUNT, PTR ; Start at end of block,
; and go backwards.
move.w PTR, d1 ; Find (area + n) % 4.
andi.w #3, d1
sub.l d1, COUNT ; We will byte-zero this
; many bytes to longword align,
; so subtract this from the
; total.
moveq.l #0, d2
bra @2 ; Zero the last 3 or so
@1: move.b d2, -(PTR) ; bytes to align PTR on a
@2: dbra d1, @1 ; longword boundary.
; Clear out as many registers
; as possible to use with MOVEM.
move.l d2, d1 ; Don't need d1 for scratch.
move.l d2, d3
move.l d2, d4
cmpi.l #TotalSize, COUNT
bls @6 ; If it is a smaller block, then
; jump to the section that clears
; 16 at a time.
; Otherwise we'll need to clear
; out the rest of these registers
; to use in a huge MOVEM.
move.l d2, d5
move.l d2, d6
move.l d2, d7
movea.l d2, a1
movea.l a1, a2
movea.l a1, a3
movea.l a1, a4
movea.l a1, a5
; Now zero as much as we can
; in blocks of 144 (3 movem's
; of 48 bytes each).
bra @4
@3: Loop ; Add more Loops here if you
Loop ; like, but adjust the #defines
Loop ; above.
@4: subi.l #TotalSize, COUNT
bge @3
; The remaining block is now
; less than TotalSize, so now
; zero as much as we can
; in blocks of 16.
addi.l #TotalSize, COUNT
bra @6
@5: movem.l d1-d4, -(PTR)
@6: subi.l #16, COUNT
bge @5
; The remaining block is now
; less than 16, so now
; use byte-zeroing.
addi.l #16, COUNT
bra @8
@7: move.b d2, -(PTR)
@8: dbra COUNT, @7
; Restore nuked registers.
movem.l (sp)+,d0-d7/a0-a5
}
} /* BlockZero */
/* eof BlockZero.c */
- --
/*
* Mike Gleason, NCEMRSoft Worldwide Development. mgleason@cse.unl.edu
* Patriots fan, and damn proud of it! "Bienvenido a Macintosh" --7.0e1
*/
+++++++++++++++++++++++++++
From: wombat@claris.com (Scott Lindsey)
Date: 21 Sep 92 23:13:23 GMT
Organization: Claris Corporation, Vancouver WA
In article <D88-JWA.92Sep18144437@dront.nada.kth.se>, d88-jwa@dront.nada.kth.se (Jon Wtte) writes:
>
>
> How about MOVEQ.L?
>
On the 68000, MOVEQ can only move 8 bits into a data register; the 8 bits are
sign-extended into a long.
MOVEQ's timing is 8(2/0); 4(1/0) on an '010.
- --
Scott Lindsey <wombat@claris.com>
+++++++++++++++++++++++++++
From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University)
Date: 23 Sep 92 06:08:56 GMT
Organization: University of Waikato, Hamilton, New Zealand
In article <1992Sep21.223609.4608@unlinfo.unl.edu>, mgleason@cse.unl.edu (Mike Gleason) writes:
> :-) I evidently forgot to mention that I was using VIA_ticks() that comes
> with Symantec's ANSI library. According to them there are "approximately
> 780,000" per second. Originally my test suite was using system Ticks, but
> 60 units per second isn't enough so I decided to try using the VIA timer
> that Symantec's profiler uses.
Why not just use the Time Manager? You can get microsecond accuracy (none
of this "approximately 780,000" business). Details in IM6.
Lawrence D'Oliveiro fone: +64-7-856-2889
Computer Services Dept fax: +64-7-838-4066
University of Waikato electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
---------------------------
End of C.S.M.P. Digest
**********************